In this tutorial you will take observational data and plot the resulting size spectra for the community and for individual species. This will give you a concrete understanding of what size spectra are. There are various different ways of how size spectra can be represented, and this is a common source of confusion. We will discuss this in detail.
TODO: Write about the origin of the size spectrum concept, with references. In particular we need to take into account that participants come from different backgrounds and therefore will have differing prior associations with the term “size-spectrum”.
To analyse and plot the data we will be making use of the tidyverse package, in particular dplyr and ggplot2. If you are not familiar with these, you can learn what is needed just by studying the example code in this tutorial. You will find explanations of the code if you expand the “R details” sections that you find below many of the code chunks. To expand an explanation section, just click on the “R details”.
library(tidyverse)
First we load the data.
data <- readRDS("size-data.rds")
str(data)
## 'data.frame': 925594 obs. of 3 variables:
## $ species: chr "Cod" "Cod" "Cod" "Cod" ...
## $ weight : num 52.77 56.37 6.14 5.66 5.89 ...
## $ length : num 17.41 17.8 8.5 8.27 8.38 ...
The readRDS() function loads the file “size-data.rds”
which contains the data frame with the size data. We assign this data
frame to the variable data for use below and then, to get
an impression of what is in the data frame we use the str()
function.
You can always easily read more information about the functions we are using by clicking on the function name. This will open the function’s help page in a new browser window.
The data consists of measurements of the length in centimetres of 925594 fish of various species. The species included are
unique(data$species)
## [1] "Cod" "Whiting" "Haddock" "Saithe" "Herring" "Sandeel"
## [7] "Nor. pout" "Plaice" "Sole"
This data was assembled by Ken Andersen at DTU Aqua in Copenhagen.
The length in centimetres \(l\) was converted to weight in
grams \(w\) using the standard
allometric relationship
\[ w = a\, l ^ b,\]
where the coefficient \(a\) and the exponent \(b\) are species-specific parameters (we’ll discuss how to find such species parameters in a later tutorial). The reason we like to work with weight as the measure of a fish’s size is that there are well-known allometric relationships between weight and physiological rates. For example metabolic rate is often taken to scale as \(w^{3/4}\) and mortality is taken to scale as \(w^{-1/4}\), as we will be discussing in the section on allometric rates in the next tutorial.
When not otherwise specified, all lengths are given in centimetres \[cm\] and all weights are given in grams \[g\].
To get an impression of the size distribution of the fish, we plot a histogram of the fish weights.
p <- ggplot(data) +
geom_histogram(aes(weight), fill = "blue", colour = "black") +
labs(x = "Weight [g]",
y = "Number of fish")
p
We have used the ggplot2 package, included in the tidyverse package, to make the plot. It is much more powerful and convenient than the base plot commands. It implements the “grammar of graphics”. If you are not already using ggplot2 it is worth your time to familiarise yourself with it. However for the purpose of this tutorial, you can simply pick up the syntax from the examples we give.
In the above we first specify with ggplot(data) that the
graph shall be based on the data frame data that we loaded
previously. We then add features to the graph with the
+.
First geom_histogram() specifies that we want a
histogram plot. The argument specifies the variable to be represented.
Note how this is wrapped in a call to aes(). Don’t ask why,
that is the way the grammar of graphics likes it. The specification of
how the variables are tied to the aesthetics of the graph will always be
given withing the aes() function.
We then specify that we want the bars in the histogram to be blue
(fill = "blue") with a black border
(colour = "black"). Such tuning of the appearance is of
course totally optional. By the way: one has to admire how the ggplot2
package accepts both colour and colour, so
that our US friends can use color = "black.
Then we add our own labels to the axes with labs().
We assign the resulting plot to the variable p because
that way we can manipulate the plot further below. Because the
assignment operator in R does not display any result, we added another
line with just the variable name p which will display the
plot.
The plot is not very informative. It just tells us that most fish are very small but there is a small number of very large fish. We can not see much detail. The first way to improve the plot is to plot the y-axis on a logarithmic scale. That has the effect of stretching out the small values and squashing the large values, revealing more detail.
p + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 2 rows containing missing values (geom_bar).
We get a warning because there were bins that contained no fish, and taking the log of 0 is not allowed. We can ignore these warnings because the empty bins will simply not be given a bar in the resulting plot.
The second way to deal with the fact that there are so many more small fish than large fish, is to make the bin width larger at larger sizes and smaller at smaller sizes. For example we could make the smallest bin go from 1 gram to 2 gram, the next bin to go from 2 gram to 4 gram, and so on, with each next bin twice the size of the previous. Let’s create the break points between these bins:
(log_breaks <- seq(from = 0, to = 11, by = 1))
## [1] 0 1 2 3 4 5 6 7 8 9 10 11
(breaks <- 2 ^ log_breaks)
## [1] 1 2 4 8 16 32 64 128 256 512 1024 2048
We had decided that we wanted the breaks between the bins to be at
powers of 2. We first create the vector log_breaks with the
exponents. The seq() function creates a vector of numbers
starting at from and going up to to in steps
of size by. You do not need to give the names of the
arguments because they can also be identified by their position. So you
could also have written seq(0, 11, 1). Such integer
sequences are used so often that there is the even shorter notation
0:11 giving the same sequence.
Note how R is good at vectorised calculations. The second line above
creates a vector containing the powers of 2 with the exponents taken
from the entries of the vector log_breaks.
The reason we have put parentheses ( ... ) around the
assignments in the above code is because that also leads to the result
being displayed whereas the assignment operator without those
parentheses around it would not display anything.
Now we can use these logarithmically spaced bins in the histogram:
ggplot(data) +
geom_histogram(aes(weight), fill = "blue", colour = "black",
breaks = breaks) +
labs(x = "Weight [g]",
y = "Number of fish") +
scale_y_log10()
The heights are now slightly more even among the bins.
We find it very important that you break up your reading of the tutorials with some hands-on work that builds on what you have just learned. Therefore you will find exercises throughout the tutorials. You have now reached the first such exercise. You will find that it can be accomplished using the R functions you have already learned about. Please do your work in the worksheet that accompanies this tutorial. You will find that worksheet in the file “worksheet1-observed-size-spectra.Rmd” in your worksheet repository for this week. This will sound familiar to you if you have worked through the Use Git and GitHub tutorial and its associated worksheet. If you have not worked through that tutorial yet, please do so now.
If you have not yet created this week’s worksheet repository, you can do so by following this link:
https://classroom.github.com/a/5cnr1H7R
You must then clone this to your computer and open it in RStudio. In that worksheet you will find the following first exercise:
Now you want to double the number of logarithmically-sized bins used to get a more detailed picture of the size spectrum. So instead of using 11 bins to cover the range from 1g to 2048g, where each bin is twice as large as the previous, you want to use 22 bins where each bin is larger than the previous one by a factor of \(\sqrt{2}\).
Create the vector with the breaks between the bins and then use that when you plot the histogram.
After successful completion of that exercise, return here to continue with the tutorial. Do not do the other exercises until you reach them in this tutorial.
Note that the height of the bars changed as we changed how we bin the data. That is obvious. If we make a bin twice as large, we expect twice as many fish in that bin. To get rid of this dependence on the choice of the width of bins, we want to work with the density. This is obtained by dividing the height of each bar by its width.
Because it is important to understand this concept of
density, we calculate the number density by hand below,
even though ggplot has a built-in function geom_density()
that we could use instead. We first bin the data by hand, and then we
calculate and plot the densities.
To understand better what the histogram did and to improve the plots further, we bin the data ourselves. We do this by first adding a bin number to each observation, which gives the number of the bin in which the weight of the fish lies.
data_with_bins <- data |>
mutate(bin = cut(weight, breaks = breaks, right = FALSE,
labels = FALSE))
head(data_with_bins)
## species weight length bin
## 1 Cod 52.771869 17.410082 6
## 2 Cod 56.370714 17.797179 6
## 3 Cod 6.142102 8.500393 3
## 4 Cod 5.656656 8.270275 3
## 5 Cod 5.885482 8.380322 3
## 6 Cod 5.917816 8.395640 3
We used the pipe operator |> that simply pipes the
output of the code preceeding it into the first argument of the function
following it. So the above code is equivalent to
data_with_bins <- mutate(data, bin = cut(weight, breaks = breaks, right = FALSE,
labels = FALSE))
The pipe operator becomes really useful only if you do a longer sequence of operations on data. You will see examples of its use later.
The mutate() function can add new columns to a data
frame or modify existing columns. In the above example it adds a new
column bin. The entries in that column are here calculated
by the function cut that returns the label of the bin into
which an observation falls. We specify the bin boundaries with the
breaks = breaks to be the boundaries we have calculated
above. The right = FALSE means that in case an observation
falls exactly on a right bin boundary, it is not included in that bin
but instead in the next bin. The labels = FALSE means that
the bins are not labelled by the intervals but simply by integer
codes.
We then group the data by bin and calculate the number of fish in each bin
binned_numbers <- data_with_bins |>
group_by(bin) |>
summarise(Numbers = n())
binned_numbers
## # A tibble: 11 x 2
## bin Numbers
## <int> <int>
## 1 1 223609
## 2 2 188847
## 3 3 279646
## 4 4 132316
## 5 5 53082
## 6 6 17039
## 7 7 14238
## 8 8 15061
## 9 9 1570
## 10 10 159
## 11 11 27
After we have grouped together all the observations with the same bin
number with the group_by(bin), the summarize()
function creates a new data frame with one row for each group, which in
this case means one row for each bin. That data frame will always have
one column specifying the group and then we specified that we want an
extra column Numbers that just counts the number of
observations in the group with the n() function. Note that
the species is ignored in this calculation.
In the above code you see the pipe operator |> being
quite convenient, because it allows us to write the functions in the
order in which they are applied, rather than having to write
summarize(group_by(...)).
The numbers in each bin give us the heights of the histogram above.
The values for Numbers of course depend on the size of
the bins we have chosen. Wider bins will have more fish in them. It is
therefore convenient to divide these numbers by the bin widths to get
the average number density in the bins.
bin_width <- diff(breaks)
binned_numbers <- binned_numbers |>
mutate(Number_dens = Numbers / bin_width[bin])
We first calculate the widths of the bins using the
diff() function which calculates the difference between
neighbouring entries in a vector. Then when we calculate the entries for
the new Number_dens column we pick out the bin width
appropriate to the given bin for each observation.
Let’s make a plot of the number density against weight. Of course we don’t know the number density for all weights because we discretised the continuous weight variable into bins. We only have an average value for the number density in each bin, which we use to determine the height of the curve at the midpoint of the bin. The plot interpolates between these discrete points by straight lines to produce a continuous curve.
mid_points <- 2 ^ (log_breaks + 1/2)
ggplot(binned_numbers) +
geom_line(aes(x = mid_points[bin], y = Number_dens)) +
labs(x = "Weight [g]",
y = "Number density")
Again the graph tells us that most of the individuals are very small, but we can not see any of the details. We therefore plot the density on log-log axes:
ggplot(binned_numbers) +
geom_line(aes(x = mid_points[bin], y = Number_dens)) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Weight [g]",
y = "Number density")
It is time for your second exercise of the course. Again go to the accompanying worksheet file in RStudio, where you will find the following exercise:
Earlier you increased the number of bins from 11 to 22. Because the same number of observed fish was then spread over this larger number of bins, all the bars in the histogram were accordingly less high. By going to the number density we have corrected for this. The density plot created with the 22 bins will of course not look exactly the same as the one created with 11 bins. It will look more ragged because it exposes the noise in the data more.
Create the plot of the number density using the 22 logarithmically sized bins from exercise 1.
The number density in the above log-log plot is described approximately by a straight line. We can approximate the slope and intercept of that straight line by fitting a linear model
(model <- lm(log(Number_dens) ~ log(mid_points[bin]), data = binned_numbers))
##
## Call:
## lm(formula = log(Number_dens) ~ log(mid_points[bin]), data = binned_numbers)
##
## Coefficients:
## (Intercept) log(mid_points[bin])
## 14.563 -2.241
This tells us that the straight-line approximation has a slope of about -2.24. We can also ask ggplot to put this line into the plot, together with its 5% confidence interval:
ggplot(binned_numbers, aes(x = mid_points[bin], y = Number_dens)) +
geom_line() +
scale_x_log10() +
scale_y_log10() +
labs(x = "Weight [g]",
y = "Number density") +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
The linear regression line is produced by
geom_smooth(method='lm'). Note how we moved the call to
aes() into the call to ggplot(). That is then
used automatically for both the geom_line() and the
geom_smooth(), so that we did not have to specify the
information twice.
If we denote the number density at weight \(w\) by \(N(w)\), then the above tells us that \[\log(N(w)) \approx 14.6 + \log(w)^{-2.24}.\]
If we exponentiate both sides to get rid of the logarithms this gives
\[ N(w) \approx \exp(14.6) w^{-2.24} = N(1) w^{-\lambda} \]
with \(\lambda \approx 2.24\).
A straight line on a log-log plot indicates a power-law relationship between the variables with the slope of the line being the exponent in the power law.
Of course the approach we took above of estimating the exponent in the power law from the binned data is not ideal. If one has access to unbinned data, as we have here, one should always use that unbinned data for inference. So the better way to infer the exponent from our data would be to ask: “If we view our set of observed fish sizes as a random sample from the population described by the power law, for which exponent would our observations be the most likely”. In other words, we should do a maximum likelihood estimation of the exponent. We’ll skip the mathematical details and just tell you the result that the maximum likelihood estimate for the power-law exponent \(\lambda\) is \[\lambda = 1+\frac{n}{\sum_{i=1}^n\log\frac{w_i}{w_{min}}},\] where \(n\) is the number of observed fish, \(w_i\) is the weight of the \(i\)th fish and \(w_{min}\) is the weight of the smallest fish. For our data this gives
w_min = min(data$weight)
lambda <- 1 + nrow(data) / sum(log(data$weight / w_min))
lambda
## [1] 1.709584
So this approach, which gives equal weight to each observation rather than giving equal weight to each bin, gives a lower value for \(\lambda\), namely \(\lambda \approx 1.71\) instead of \(\lambda \approx 2.24\). But this is still not the end of the story, because we did not take measurement error into account. We assumed that we sample perfectly. But in reality, small individuals are much easier to miss than large ones, so our data is almost certainly under-reporting the number of small individuals, which leads to a smaller \(\lambda\).
A good approach is to try to observe the power law over a wider range of sizes, all the way from bacteria to whales. This is what Sheldon et.al. did in 1972 and he observed that \(\lambda \approx 2\). More thorough investigations since then have led to values just slightly above \(2\) on average. TODO: references.
Above we first calculated the number of fish in each weight bin and then divided by the width of each bin to obtain the average number density in each bin. Exactly analogous to that, we can calculate the biomass of all the fish in each weight bin and divide that by the width of each bin to obtain the average biomass density in each bin.
binned_biomass <- data_with_bins |>
group_by(bin) |>
summarise(Biomass = sum(weight)) |>
mutate(Biomass_dens = Biomass / bin_width[bin])
Make a plot the biomass density similarly to how we plotted the number density above, using logarithmic axes and also plotting the straight-line approximation.
Fitting a linear model to the binned biomass data gives
(model <- lm(log(Biomass_dens) ~ log(mid_points[bin]), data = binned_biomass))
##
## Call:
## lm(formula = log(Biomass_dens) ~ log(mid_points[bin]), data = binned_biomass)
##
## Coefficients:
## (Intercept) log(mid_points[bin])
## 14.615 -1.265
The slope is -1.265 for the biomass density whereas it was -2.24 for the number density. This makes perfect sense, because if we denote the biomass density by \(B(w)\) then \(B(w) = w N(w)\) and hence if \(N(w) \propto w^{-\lambda}\) then \(B(w)\propto w^{1-\lambda}\).
There is perhaps something a bit strange about the way we visualised the size spectrum via the number density or the biomass density plotted on logarithmic axes.
Above we calculated densities by dividing the total number or total biomass in each bin by the width that the bin has on the \(w\) axis. That is fine. Then we decided to use a logarithmic axis that showed \(\log(w)\). But then would it not have been more natural to calculate the densities by dividing by the width of the bin on that logarithmic axis?
Well, the answer to that is probably a matter of taste. Different people make a different choice. It is very important to be aware of this to understand discrepancies between reported power-law exponents: they depend on the choice made.
The result of dividing the total biomass in each bin by the width of the bin on the logarithmic weight axis is sometimes called the Sheldon spectrum, because it is this density that is approximately constant. Below we plot the Biomass density (in black) and the Sheldon density (in blue) on the same axes.
log_bin_width <- diff(log_breaks)
binned_biomass <- binned_biomass |>
mutate(Sheldon_dens = Biomass / log_bin_width[bin])
ggplot(binned_biomass) +
geom_line(aes(x = mid_points[bin], y = Biomass_dens),
colour = "black") +
geom_line(aes(x = mid_points[bin], y = Sheldon_dens),
colour = "blue") +
scale_x_log10() +
scale_y_log10() +
labs(x = "Log Weight",
y = "Density")
We will refer to Sheldon’s density as “the biomass density in log weight” and notate it as \(B_{\log w}(w)\). Similarly we introduce “the number density in log weight” \(N_{\log w}(w)\).
You will notice the reduced slope. In fact the slope of the Sheldon density is less negative than the slope of the biomass density. We have the following relations among the various densities:
\[B_{\log w}(w) = w\, B(w) = w\, N_{\log}(w) = w^2 N(w).\]
Sheldon’s observation was that \(B_{\log w}(w)\) is approximately constant over a large range of sizes \(w\) from bacteria to whales. That corresponds to the earlier phrasing of Sheldon’s observation that \(N(w)\propto w^{-\lambda}\) with \(\lambda\) close to 2.
Binning the data is not the only way to approximate the densities.
One can also use the kernel density estimation method. ggplot2 even has
that built in to its geom_density(). Here is a plot of the
number density in log weight \(N_{\log
w}(w)\) estimated by the kernel density method:
ggplot(data) +
geom_density(aes(weight, stat(count)), adjust = 4) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Weight [g]",
y = "Number density in log w")
By default, geom_density() would normalise the density
so that the integral under the density curve is 1. We use
stat(count) to tell it that we want the number density, so
that the integral under the curve is the total number, not normalised to
1.
The kernel density method works by placing a small Gaussian (bell
curve) at each data point and then adding up all these little Gaussians.
The adjust = 4 means that the width of these Gaussians is 4
times wider than geom_density() would choose by default.
This leads to a smoother density estimate curve. This plays a similar
role as the bin width does in histograms.
So far we have looked at the community spectrum, where we ignored the species identity of the fish. We will now look at the spectra of the individual species. We’ll plot them all on the same graph and display them with plotly so that you can hover over the resulting graph to see which line corresponds to which species. We also include the community size spectrum in black for comparison.
p <- ggplot(data) +
geom_density(aes(weight, stat(count), colour = species), adjust = 4) +
geom_density(aes(weight, stat(count)), colour = "black", adjust = 4) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10", limits = c(1, NA)) +
labs(x = "Weight [g]",
y = "Number density in log w")
plotly::ggplotly(p)
## Warning: Transformation introduced infinite values in continuous y-axis
Note how easy it was to ask ggplot to draw a separate line for each
species. We only had to add the colour = species to
geom_density(). ggplot automatically chose some colours for
the species and added a legend to the plot.
We have replaced scale_x_log10() by
scale_x_continuous(trans = "log10") which does exactly the
same thing. But the latter form is more flexible, which we used in
scale_y_continuous(trans = "log10", limits = c(1, NA)) to
also specify limits for the y axis. We chose to plot densities above 1
only because at smaller values there is only noise. We did not want to
set an upper limit, hence the NA.
We then did not display the plot p but instead fed it to
plotly::ggplotly(). The ggplotly() function
takes a plot created with ggplot and converts it into a plot created
with plotly. plotly plots are more interactive. We called the
ggplotly() function with plotly::ggplotly()
because that way we did not need to load the plotly package first, i.e.,
we did not need to do library(plotly).
Don’t worry about the fact that in places the estimate of the number density for the entire community is below the estimate of the number density for individual species. That just shows that we are dealing only with estimates of the true densities based on a finite amount of data.
There are message you should take away from this plot:
We will discuss species size spectra more in the next tutorial, where we will look at them with the help of the mizer model.
Now make a size-spectrum plot for a new data set. If you happen to have access to size data yourself, you could use that. Otherwise use … TODO: provide data set.
Summary from Asta 1. It is very useful to know how many organisms of different sizes there are. This is what size spectra show 2. Yet, we can measure sizes in different ways. The key thing to understand is how we group different sizes into bins. The width of these bins will determine how many individuals we find in them. A bin from 1 to 2g will have fewer individuals than a bin from 1 to 10g. 3. To avoid variable bin sizes we use normalised densities, where numbers of individuals in each bin are divided by the width of the bin. This is called the number density spectrum. 4. When we group observed individuals into bins, for each bin we can calculate either a total number of observations or the total weight of all individuals in that bin. This is NOT the same. There will be a lot of very small things, so the numbers in a small sized bin will be large. But their total weight will not be high. This means that we can talk either about abundance size spectra or biomass size spectra. The abundance size spectra will be steeper, because there will be lots of small things and few large things. The biomass size spectra will be much more flat, because the normalised biomass is about the same.